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Abstract 

Most machine learning tools work with a single ta- 
ble where each row is an instance and each column 
is an attribute. Each cell of the table contains an 
attribute value for an instance. This representation 
prevents one important form of learning, which is, 
classification based on groups of correlated records, 
such as multiple exams of a single patient, internet 
customer preferences, weather forecast or predic- 
tion of sea conditions for a given day. To some 
extent, relational learning methods, such as induc- 
tive logic programming, can capture this correla- 
tion through the use of intensional predicates added 
to the background knowledge. In this work, we pro- 
pose SPPAM, an algorithm that aggregates past ob- 
servations in one single record. We show that ap- 
plying SPPAM to the original correlated data, be- 
fore the learning task, can produce classifiers that 
are better than the ones trained using all records. 

Keywords: multi-relational data, classification, data prepro- 
cessing. 

1 Introduction 

Machine learning techniques have been successfully applied 
to various domains. However there is a lack of formal 
methodology and application of machine learning tools to 
datasets that are characterized by subgroups of correlated 
records. Examples are medical records with multiple exams 
of a single patient, internet customer preferences, weather 
forecast and prediction of sea conditions, among others. De- 
spite the fact that there are many applications that fall into 
this category, there is also a lack of available datasets with 
this characteristic in the main UCI machine learning reposi- 
tory (http : / / archive . ics . uci . edu/ml/ ). 

Machine learning tools usually learn classifiers from a sin- 
gle table where each row is an instance and each column is an 
attribute. Each cell of the table contains an attribute value for 
an instance. Most of these tools treat each row of this table as 
independent from each other, which prevents one important 
form of learning based on groups of correlated records. Tools 
based on first order logic (inductive logic programming) can 
partially overcome this problem because they can do multi- 



relational learning. But first order rules in the form of inten- 
sional predicates need to be added to the background knowl- 
edge, in order to code the multi-relational meaning intended 
by the observer |rj. 

When dealing with data that have this multi-relational char- 
acteristic, one additional problem arises when using cross- 
validation. Ideally, records that belong to the same obser- 
vation period need to be manually separated in a way that 
all records of a certain period falls into just one fold. Ma- 
chine learning tools like WEKA ||2j, for example, do not al- 
low training based on pre-defined folds (unless when using 
the percentage split training option). 

In this work, we propose a general method that connects 
records that are correlated (either through the same location 
or observation period). To the best of our knowledge, this is 
the first work that tackles this problem. 

We propose SPPAM (Statistical Preprocessing Algo- 
rithM), an algorithm that aggregates past correlated observa- 
tions in one single record. We apply SPPAM to two datasets 
of surf conditions. Our task is to learn a classifier that predicts 
well if a certain beach is adequate for surfing in a certain day. 
We perform our experiments using the WEKA machine learn- 
ing tool and compare the performance of various WEKA al- 
gorithms trained on the original datasets and on the SPPAM- 
transformed datasets. We show that applying SPPAM to the 
original correlated data, before the learning task, can produce 
classifiers that are better than the ones trained using the orig- 
inal datasets with all records. 

Some work has been done on characterizing relations from 
weather observations and forecasts using machine learning 
techniques. For example, Ingsrisawan ef a/, used support vec- 
tor machines, decision trees and neural networks to develop 
models to predict rainfall occurrences in Thailand |4|. Lai 
et al. proposed a preprocessing technique for weather data 
in order to predict temperature and weather conditions ijs). 
Williamsef al. proposed the use of Random Forests to predict 
and classify storm forecastings |3]. However, we are not only 
interested in temporal patterns nor pure weather forecast. Our 
goal is to provide a generic preprocessing technique that in- 
crease the classification task's performance for every suitable 
dataset. 

This paper is organized as follows. In Section |2] we de- 
scribe the SPPAM algorithm. In Section |3] we discuss the 
methodology used to run our experiments and present the 



datasets. In Section]?] we present and discuss our results. 
Finally, in Section ]5] we draw our conclusions and give per- 
spectives of future work. 

2 An approach to classify multiple correlated 
data 

SPPAM is a two-step algorithm that captures the hierarchi- 
cal aspect of learning from a dataset with multiple records for 
the same observation period (or location). The first step is 
to separate and consolidate records that belong to the same 
time/location interval. The user needs to provide the name 
of the attribute that will be used to perform this separation 
and the name of the class attribute. We also assume that each 
record has a unique identifier. We use a transformation that 
maps several records into just one record along with a trans- 
formation on the original attributes. The algorithm can be 
seen in Algorithm]!] 

Data: 

Dataset, // original dataset 
Result: 

Out, // original dataset transformed 

Initiahze a new empty dataset Out; 
Read Dataset; 

toreach Attribute a in Dataset do 
if a has type Numeric then 

create Attributes a-Maximum, a-Minimum, 
a- Average and a-Last on Out; 

end 

else if a is Nominal then 

create Attributes a-Frequency for each nominal 
value and a-Last on Out; 

end 
else 

I copy a to Out; 
end 

end 

Group correlated records according to the user provided 
field; 

foreach Group i do 

read each individual attribute value A; 
if A has type String or is the ID then 

I copy A to Out; 
end 

if A has type numeric then 

calculate Maximum, Minimum, Average and 
Last values among all values of A for group i; 
copy them to Out; 

end 

if A is nominal then 

copy frequency and the last value of A in group i 
to Out; 

end 

Take the value of the class variable of last instance of 
Group i; 

Copy it to Out to complete the record 

end 

Algorithm 1: The SPPAM algorithm 



This basic version of the algorithm maps groups of records 
of each observation to just one record by computing aggre- 
gates for the values of the attributes. But what to do with the 
class variable? In this algorithm, we keep the last class value 
of the group (i.e., the most recent observation). 

Figures]T]and]2]illustrate an example of this transformation. 

@ATTRIBUTE Date String 
@ATTRIBUTE Wind_Knots numeric 

@ATTRIBUTE Wind_Dir {N, NE, E, SE, S, SW, W, NW} 

@ATTRIBUTE Surf {0,1} 
@DATA 

18-11-2010,15 . 6, SE, 
18-11-2010, 9. 7, SE, 
18-11-2010,3 . 9, SE, 

18- 11-2010,5 . 8,NE, 

19- 11-2010, 11 . 7, NE, 
19-11-2010,15 . 6,NE, 
19-11-2010,13 . 6, E, 1 
19-11-2010,15 . 6, E, 1 

Figure 1 : Original dataset 

@ATTR1BUTE Date STRING 
@ATTR1BUTE Wind_Knots_MAX NUMERIC 
@ATTRIBUTE Wind_Knots_MIN NUMERIC 
@ATTRIBUTE Wind_Knots_AVG NUMERIC 
SATTRIBUTE Wind_Knots_LAST NUMERIC 
@ATTR1BUTE Wind_Dir_N_PERC NUMERIC 
@ATTR1BUTE Wind_Dir_NE_PERC NUMERIC 
@ATTR1BUTE Wind_Dir_E_PERC NUMERIC 
@ATTR1BUTE Wind_Dir_SE_PERC NUMERIC 
@ATTR1BUTE Wind_Dir_S_PERC NUMERIC 
@ATTR1BUTE Wind_Dir_SW_PERC NUMERIC 
@ATTRIBUTE Wind_Dir_W_PERC NUMERIC 
SATTRIBUTE Wind_Dir_NW_PERC NUMERIC 

SATTRIBUTE Wind_Dir_LAST (N, NE, E, SE, S, SW, W, NW} 

SATTRIBUTE Surf {0,1} 

@DATA 

18- 11-2 010,15.6,3.9,8.75,5.B,0.0,25.0,0.0,75.0,0.0,0.0,0.0,0.0,NE,0 

19- 11-2 010,15.6,II.7,14.13,15.6,0.0,50.0,50.0,0.0,0.0,0.0,0.0,0.0,E,1 

Figure 2: SPPAM-transformed dataset 

The original dataset (Figure [T]i shows an example of two 
days of observation of weather and sea conditions for surf 
practice with 4 attributes and 8 instances. The first attribute 
Date is of type String and will be our aggregation pivot at- 
tribute, the second is of type numeric, the third attribute is 
nominal (with eight possible values) and the last attribute (the 
class) is binary. Our goal is to aggregate all observations 
within a day in one single record. 

The transformed dataset for this example has 2 data rows 
for two observation days (18-1 1-2010 and 19-11-2010), each 
with 15 attributes. The first attribute is the date. The next 
4 attributes are numeric values corresponding to the maxi- 
mum, minimum, average and last values of the WindJinots 
attribute. The following 8 numeric values correspond to the 
frequencies of each nominal value of the attribute WindJ)ir. 
The following attribute (14) is a nominal value representing 
the last observed value for the Wind_Dir attribute. The last 
attribute is the last value of the class attribute for the group. 
The number of instances of the transformed dataset drops to 
only 2 given that we had only 2 complete days of observa- 
tion in the original table. For attribute 2, WindJKnots, the 
first day has maximum value of 15.6, minimum 3.9 and aver- 
age of 8.75 and the last obtained value 5.8. The same would 
repeat for another hypothetical numeric attribute. Attribute 
Wind-Dir in the original dataset will unfold in nine attributes 



on the transformed data, because it is nominal and it has eight 
values plus the last observed value. The first unfolded value 
corresponds to the frequency of occurrence of the first value 
on that observation group. The second, to the frequency of 
occurrence of the second value and so on. 

The total number of attributes on the transformed dataset is 
given by the equation 

w 

l + .s + 4n + ^(F(u;,) + l) (1) 

where s is the number of String attributes on the original 
dataset, n is the number of numeric attributes on the original 
dataset, w is the number of nominal attributes and the func- 
tion V(w) is the number of values of the nominal attribute w, 
for all non-class attributes. 

The number of records on the transformed dataset is equal 
to the number of different unique ids on the original datasets, 
in our example the id is the date attribute. 

After this preprocessing task, the second step is to feed the 
new table (transformed dataset) to a machine learning algo- 
rithm, like any other dataset. 

Although we are dealing with meteorological data, the 
method above described is fully applicable to any kind of re- 
lational data where various records are related to the same 
individual. 

3 Methodology and Applications 

We applied our algorithm to two datasets. The first one is the 
Surf - Praia Grande dataset which has 10 attributes, 5 of them 
numeric, 4 nominal and 1 string. This dataset contains four 
daily observations of wind and sea conditions taken from the 
Praia Grande beach, Portugal, between November 18th 2010 
and January 6th 2011, in the total of 192 instances. The 10 
attributes are: date, hour, total sea height, wave height, wave 
direction, wind wave height, wind speed, wind direction, wa- 
ter temperature and wave set quality to practice surf. This 
last attribute is our class which can have 2 different values: 
and 1, where means that the weather and sea conditions are 
not good for surf practice, and 1 means that there are good 
conditions to surf. 

The second dataset is the Surf - Aljezur and it has the same 
structure (data were collected at the same period of time as 
Praia Grande). The attributes and number of instances are the 
same as the Surf - Praia Grande. 

Table[3]shows the detailed structure of the original datasets 
to be transformed by the SPPAM algorithm. A summary of 
the transformations on both datasets is shown in table |2l 

For both datasets, the number of attributes generated by 
SPPAM is 44 (which follows from equation [TJ and the num- 
ber of instances is 48 (the number of different observation 
days). 

After applying our algorithm to the datasets, we performed 
learning experiments using the WEKA tool, developed at 
Waikato University, New Zealand |2|. The experiments were 
performed in WEKA using the Experimenter module, where 
we set several parameters, including the statistical signifi- 
cance test and confidence interval, and the algorithms we 
wanted to use (we used OneR as reference, ZeroR, PART, 



Table 1: Original Surf - Praia Grande and Original Surf - 
Aljezur attributes 



Attribute 


Type 


Values 


Date 


Stiung 




Hour 


Nominal 


0.6, 12, 18 


Wave Total 


Numeric 




Wave 


Numeric 




Wave Direction 


Nominal 


N. NE, E, SE, S, SW, W. NW 


Vaga 


Numeric 




Wind Speed 


Numeric 




Wind Direction 


Nominal 


N, NE, E, SE, S, SW, W, NW 


Water Temperature 


Numeric 




Sets 


Nominal (Class) 


0,1 



Table 2: Original and SPPAM-transformed datasets summary 



Dataset 


# Instances 


Class = 


Class = 1 


Sintra 


192 


75 (39%) 


117(61%) 


Sintra SPPAM 


48 


18 (38%) 


30 (62%) 


Aljezur 


192 


48 (25%) 


144 (75%.) 


Aljezur SPPAM 


48 


9(19%) 


39(81%) 



J48, SimpleCart, DecisionStump, Random Forests, SMO, 
Naive Bayes, Bayes with TAN, NBTree and DTNB). The 
WEKA experimenter produces a table with the performance 
metrics of all algorithms with an indication of statistical dif- 
ferences, using one of the algorithms as a reference. The sig- 
nificance tests were performed using standard corrected t-test 
with a significance level of 0.01. The parameters used for the 
learning algorithms are the WEKA defaults. For all experi- 
ments we used 10-fold stratified cross-validation and report 
results for the test sets. 

4 Results 

We compared the results obtained in WEKA using our pre- 
processing method SPPAM with the results obtained with the 
original datasets. In tables |3] and |4j we present the perfor- 
mance obtained by the WEKA algorithms for both the orig- 
inal dataset and the SPPAM transformed dataset for Surf - 
Praia Grande and Surf - Aljezur We show the results obtained 
for Percentage of Correctly Classified Instances (CCI), Kappa 
Statistic (Kappa), Precision (Precis.), Recall and F-Measure 
(F-Meas.). We show the performance for each class and the 
averaged performance for both classes. Our best results with 
SPPAM are highlighted on both tables. We also present charts 
showing the average performance gain between the correctly 
classified instances average for the SPPAM datasets and for 
the original datasets on all classification algorithms. 

4.1 Praia Grande dataset results 

For this particular dataset, our best results were obtained us- 
ing Bayesian Networks (using the TAN and K2 search al- 
gorithms). Naive Bayes and DTNB, as shown in Table |3] 
Naive Bayes is the algorithm that yields the best performance 
when training with the SPPAM-transformed datasets, for ev- 
ery metric. 

In Figure [3] we show graphically the differences between 
the correctly classified instances percentage average for the 
original dataset and the SPPAM-preprocessed dataset for the 
Praia Grande data for all the machine learning algorithms we 
tested. The values are in percentage. 



Table 3: Transformed Surf - Praia Grande results 





Original Dataset 


SPPAM transformed dataset 
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F-Meas. 
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Figure 3: Delta between average CCI% Praia Grande and 
Praia Grande SPPAM 



4.2 Aljezur dataset results 

With this dataset we achieved even better results. The use of 
SPPAM before the training task improved the classification 
performance on almost all analyzed metrics. In some cases, 
we get 10% gain on the correctly classified instances percent- 
age. These results were statistical significant for BayesNet 
using K2 and TAN, SimpleCart, ZeroR, SMO and DTNB. 
For the metrics where the results were not improved (Naive 
Bayes and DTNB), the difference is not significant. 

In Figure [4] we show the difference between the averages 
of correctly classified instances percentage for the original 
dataset and the SPPAM-preprocessed dataset for the Aljezur 
dataset. Here we can see graphically how better in average, 
the classification algorithms can correctly classify new in- 
stances using our method. The values are also in percentage. 

5 Conclusions and Future Work 

In this work, we proposed a simple, general solution to the 
problem of learning classifiers for multiple correlated data 
such as multiple exams of a single patient, internet customer 
preferences, weather forecast, sea prediction, among oth- 
ers. SPPAM, a Statistical Preprocessing AlgorithM, takes the 
original dataset containing related data, and produces a new 
dataset with all correlated data aggregated using metrics such 
as maximum, minimum, average, etc. We tested SPPAM on 
two datasets that contain records associated according to a 
date. We used WEKA to train on the original datasets and on 
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Table 4: Transformed Surf - Aljezur results 
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land. Proceedings of World Academy of Science, Engineering 
and Technology, 31(248-253), 2008. 

[5] L.L. Lai, H. Braun, Q.P. Zhang, Q. Wu, Y.N. Ma, W.C. Sun, and 
L. Yang. Intelligent weather forecast. Proceedings of 2004 In- 
ternational Conference on Machine Learning and Cybernetics, 
7:4216-4221,2004. 



the SPPAM-transformed dataset. Our results indicate that the 
SPPAM transformation can produce better classifiers than the 
ones trained on the original dataset. 

In its present form, SPPAM has already shown its potential, 
but we have been working on modifications to the basic algo- 
rithm in order to improve performance even further. We also 
have been working on applying SPPAM to medical datasets 
that contain multiple records for a single patient. 
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